Predictive Modelling Of Mortality Rates for Diabetes: An Analysis Of Risk Factors
Diabetes is a chronic disease affecting millions globally and a leading cause of death. Understanding mortality trends and predicting future mortality rates is crucial for public health planning and intervention strategies. Our objective is to identify critical factors that contribute to mortality from diabetes and develop a suitable model for predicting future mortality rates. The outcomes of this study could inform policies and interventions aimed at reducing the burden of diabetes-related mortality.
Resource: CDC WONDER
This project analyzes mortality rates for diabetes in the United States from 2015 to 2020 and employs various predictive methods in Statistics and Machine Learning to forecast future mortality rates.
Variables:
According to the age distribution, the older the age, the higher the mortality rate of diabetics. the greatest mortality rate was recorded in the 80-84 age group, reaching 5963 per 100,000. More surprisingly, infants younger than one year of age also have a high mortality rate, ranking seventh out of 18 age groups, at 570 per 100,000. The lowest mortality rate was in the 5-9 age group, with a rate of 13 per 100,000.
In terms of ethnic distribution, the death rate for non-Hispanics and Latinos is higher than that for Hispanics, almost twice as high. American Indians or Alaska Natives have the highest mortality rate, with a rate of approximately 1061 per 100,000. Asian and Pacific Islanders have the lowest mortality rate, with a rate of approximately 364 per 100,000. By gender distribution, males have a higher mortality rate than females.
The geographical distribution is clear. According to the regional map, the North Central and Southern regions have the highest mortality rates, even almost identical, at around 729, but when it comes to mortality rates per state, seven of the top ten states are from the Southern region, while the top six states are all from the South.
Since 2015, through 2019, the diabetes death rate has steadily increased in each year based on trends, but not by much. But the mortality rate in 2020 is clearly and significantly higher, and this reason is supposedly related to the outbreak of corona virus in March 2020.
Negative binomial regression is used to predict the count of deaths instead of Poisson regression. As checked, in this dataset, there is a huge gap between the mean and variance of the number of deaths.
Negative binomial regression assumes that the count data follows a negative binomial distribution, and the predictor variables have a linear relationship with the logarithm of the expected count.
Link Function:
\(\log{\mbox{deaths}} = \mbox{Intercept} + b_1x_1+b_2x_2+\cdot + b_kx_k\)
| Models | Predictors |
|---|---|
| Model 1 | Sex |
| Model 2 | Race |
| Model 3 | Age Group |
| Model 4 | Age Group, Race |
| Model 5 | Race, Sex |
| Model 6 | Age Group, Sex |
| Model 7 | Age Group, Race, Sex |
| Model 8 | Age Group, Race, Sex, Ethnicity |
| Model 9 | Age Group, Race, Sex, Ethnicity, Year |
| Model 10 | Age Group, Race, Sex, Ethnicity, State |
| Model 11 | Age Group, Race, Sex, Ethnicity, Year, State |
| d.f. | dev | disp | AIC | 2Llik | |
|---|---|---|---|---|---|
| Model 1 | 29655 | 38327.05 | 0.43 | 394593.2 | -394587.2 |
| Model 2 | 29653 | 37035.48 | 0.54 | 385368.9 | -385358.9 |
| Model 3 | 29639 | 36392.88 | 0.60 | 380666.2 | -380628.2 |
| Model 4 | 29636 | 34917.13 | 0.80 | 368987.6 | -368943.6 |
| Model 5 | 29652 | 37023.40 | 0.54 | 385282.5 | -385270.5 |
| Model 6 | 29638 | 36332.84 | 0.60 | 380225.4 | -380185.4 |
| Model 7 | 29635 | 34842.72 | 0.81 | 368383.2 | -368337.2 |
| Model 8 | 29634 | 34187.59 | 0.94 | 362565.6 | -362517.6 |
| Model 9 | 29633 | 34177.46 | 0.95 | 362472.7 | -362422.7 |
| Model 10 | 29585 | 31961.29 | 1.87 | 338436.4 | -338290.4 |
| Model 11 | 29584 | 31940.07 | 1.88 | 338160.2 | -338012.2 |
| Estimate | Std. Error | z value | Pr(>|z|) | Relative Rates | |
|---|---|---|---|---|---|
| (Intercept) | -83.9575 | 5.0791 | -16.5301 | 0.0000 | 0.00 |
| Age_Groups1-4 | -1.7116 | 0.0343 | -49.8432 | 0.0000 | 0.18 |
| Age_Groups5-9 | -2.1792 | 0.0395 | -55.2057 | 0.0000 | 0.11 |
| Age_Groups10-14 | -1.9433 | 0.0358 | -54.2072 | 0.0000 | 0.14 |
| Age_Groups15-19 | -0.7250 | 0.0289 | -25.0662 | 0.0000 | 0.48 |
| Age_Groups20-24 | -0.1570 | 0.0272 | -5.7687 | 0.0000 | 0.85 |
| Age_Groups25-29 | 0.0695 | 0.0266 | 2.6083 | 0.0091 | 1.07 |
| Age_Groups30-34 | 0.1861 | 0.0263 | 7.0690 | 0.0000 | 1.20 |
| Age_Groups35-39 | 0.3233 | 0.0260 | 12.4187 | 0.0000 | 1.38 |
| Age_Groups40-44 | 0.4668 | 0.0257 | 18.1496 | 0.0000 | 1.59 |
| Age_Groups45-49 | 0.7794 | 0.0252 | 30.8859 | 0.0000 | 2.18 |
| Age_Groups50-54 | 1.1258 | 0.0248 | 45.3247 | 0.0000 | 3.08 |
| Age_Groups55-59 | 1.4469 | 0.0245 | 59.1772 | 0.0000 | 4.25 |
| Age_Groups60-64 | 1.6606 | 0.0243 | 68.2836 | 0.0000 | 5.26 |
| Age_Groups65-69 | 1.7824 | 0.0242 | 73.6662 | 0.0000 | 5.94 |
| Age_Groups70-74 | 1.8682 | 0.0242 | 77.2892 | 0.0000 | 6.48 |
| Age_Groups75-79 | 1.9507 | 0.0242 | 80.6815 | 0.0000 | 7.03 |
| Age_Groups80-84 | 2.0577 | 0.0242 | 85.1017 | 0.0000 | 7.83 |
| RaceAsian or Pacific Islander | -0.3769 | 0.0203 | -18.5608 | 0.0000 | 0.69 |
| RaceBlack or African American | 1.4995 | 0.0187 | 80.3979 | 0.0000 | 4.48 |
| RaceWhite | 2.8806 | 0.0178 | 161.9116 | 0.0000 | 17.82 |
| SexMale | 0.3888 | 0.0087 | 44.8165 | 0.0000 | 1.48 |
| EthnicityNot Hispanic or Latino | 2.2334 | 0.0119 | 187.2333 | 0.0000 | 9.33 |
| Year | 0.0417 | 0.0025 | 16.5794 | 0.0000 | 1.04 |
| StateAlaska | -0.9074 | 0.0507 | -17.8897 | 0.0000 | 0.40 |
| StateArizona | 0.3625 | 0.0407 | 8.9172 | 0.0000 | 1.44 |
| StateArkansas | -0.6982 | 0.0460 | -15.1911 | 0.0000 | 0.50 |
| StateCalifornia | 1.9188 | 0.0388 | 49.4843 | 0.0000 | 6.81 |
| StateColorado | -0.2995 | 0.0424 | -7.0615 | 0.0000 | 0.74 |
| StateConnecticut | -0.8107 | 0.0446 | -18.1629 | 0.0000 | 0.44 |
| StateDelaware | -1.7077 | 0.0508 | -33.6317 | 0.0000 | 0.18 |
| StateFlorida | 1.0673 | 0.0398 | 26.7844 | 0.0000 | 2.91 |
| StateGeorgia | 0.4867 | 0.0419 | 11.6120 | 0.0000 | 1.63 |
| StateHawaii | 0.5450 | 0.0483 | 11.2953 | 0.0000 | 1.72 |
| StateIdaho | -1.4138 | 0.0530 | -26.6877 | 0.0000 | 0.24 |
| StateIllinois | 0.6257 | 0.0417 | 15.0209 | 0.0000 | 1.87 |
| StateIndiana | -0.2253 | 0.0433 | -5.1987 | 0.0000 | 0.80 |
| StateIowa | -1.1653 | 0.0477 | -24.4357 | 0.0000 | 0.31 |
| StateKansas | -1.1252 | 0.0439 | -25.6156 | 0.0000 | 0.32 |
| StateKentucky | -0.4378 | 0.0465 | -9.4204 | 0.0000 | 0.65 |
| StateLouisiana | -0.0567 | 0.0435 | -1.3022 | 0.1928 | 0.94 |
| StateMaine | -1.2720 | 0.0638 | -19.9253 | 0.0000 | 0.28 |
| StateMaryland | 0.0025 | 0.0429 | 0.0574 | 0.9542 | 1.00 |
| StateMassachusetts | -0.3581 | 0.0427 | -8.3789 | 0.0000 | 0.70 |
| StateMichigan | 0.1221 | 0.0414 | 2.9471 | 0.0032 | 1.13 |
| StateMinnesota | -0.7054 | 0.0421 | -16.7632 | 0.0000 | 0.49 |
| StateMississippi | -0.1515 | 0.0473 | -3.2066 | 0.0013 | 0.86 |
| StateMissouri | -0.2451 | 0.0437 | -5.6122 | 0.0000 | 0.78 |
| StateMontana | -1.0990 | 0.0532 | -20.6553 | 0.0000 | 0.33 |
| StateNebraska | -1.5997 | 0.0485 | -32.9612 | 0.0000 | 0.20 |
| StateNevada | -0.5055 | 0.0427 | -11.8427 | 0.0000 | 0.60 |
| StateNew Hampshire | -1.3846 | 0.0641 | -21.5898 | 0.0000 | 0.25 |
| StateNew Jersey | 0.3224 | 0.0420 | 7.6727 | 0.0000 | 1.38 |
| StateNew Mexico | 0.0127 | 0.0443 | 0.2871 | 0.7741 | 1.01 |
| StateNew York | 0.9132 | 0.0398 | 22.9527 | 0.0000 | 2.49 |
| StateNorth Carolina | 0.2324 | 0.0409 | 5.6881 | 0.0000 | 1.26 |
| StateNorth Dakota | -1.6883 | 0.0566 | -29.8230 | 0.0000 | 0.18 |
| StateOhio | 0.3377 | 0.0425 | 7.9365 | 0.0000 | 1.40 |
| StateOklahoma | -0.2144 | 0.0418 | -5.1280 | 0.0000 | 0.81 |
| StateOregon | -0.8456 | 0.0434 | -19.4801 | 0.0000 | 0.43 |
| StatePennsylvania | 0.3950 | 0.0420 | 9.3992 | 0.0000 | 1.48 |
| StateRhode Island | -2.0280 | 0.0523 | -38.7602 | 0.0000 | 0.13 |
| StateSouth Carolina | -0.1325 | 0.0441 | -3.0073 | 0.0026 | 0.88 |
| StateSouth Dakota | -0.9940 | 0.0534 | -18.6271 | 0.0000 | 0.37 |
| StateTennessee | -0.0249 | 0.0433 | -0.5753 | 0.5651 | 0.98 |
| StateTexas | 1.6531 | 0.0402 | 41.1271 | 0.0000 | 5.22 |
| StateUtah | -0.9879 | 0.0466 | -21.1936 | 0.0000 | 0.37 |
| StateVermont | -2.0920 | 0.0668 | -31.3332 | 0.0000 | 0.12 |
| StateVirginia | 0.0695 | 0.0422 | 1.6491 | 0.0991 | 1.07 |
| StateWashington | -0.1205 | 0.0411 | -2.9362 | 0.0033 | 0.89 |
| StateWest Virginia | -1.0149 | 0.0531 | -19.1294 | 0.0000 | 0.36 |
| StateWisconsin | -0.6356 | 0.0423 | -15.0323 | 0.0000 | 0.53 |
| StateWyoming | -2.2503 | 0.0608 | -37.0395 | 0.0000 | 0.11 |
The baselines of each factor are :
Age group: <1
Race: American Indian or Alaska native
Ethnicity: Hispanic or Latino
Sex: Female
State: Alabama
Analysis:
Ethnicity: The risk of dying due to diabetes among Non-Hispanic or Latino is about 9.33 times higher than that for the Hispanic or Latino holding all other variables constant.
Race: Highest risk of death due to diabetes is reported for white Americans. A person of white is 17.82 times more likely to die due to diabetes than American Indian or Alaska Native on average holding all other variables constant.
Sex: Males have about 1.48 times higher rick of dying due to diabetes than females on average holding all other variables constant.
All the predictor variables in this dataset are categorical variables. Predictor variables such as age, gender and biomarkers are used to estimate a patient’s risk of developing a particular disease based on their age, gender, etc. The categorical variables are discrete and allow for easy data collection and summarisation. But using only categorical variables can lead to sparse data, especially if there are many categories or some categories occur infrequently. It may also not fully capture the complexity and nuances of the underlying phenomena. This can make it difficult to identify meaningful relationships or patterns in the data and reduce the accuracy of predictive models.
If it is possible to continue to do similar studies in the future, I would continue to add additional relevant variables in terms of the receipt data, such as including other clinical characteristics, or some laboratory values. At the same time I would consider removing less important variables to improve the accuracy of the model. Diabetes is a chronic metabolic disease that is influenced by genetic factors, environmental factors and lifestyle habits. If data on lifestyle habits could be collected, it would also help to predict the onset of diabetes and mortality more accurately in other ways.
---
title: "Diabetes Mortality"
output:
flexdashboard::flex_dashboard:
theme:
bootswatch: zephyr
primary: "#96ACA3"
orientation: columns
vertical_layout: fill
social: menu
source_code: embed
---
<style type="text/css">
.chart-title { /* chart_title */
font-size: 18px;
}
body{ /* Normal */
font_size: 16px
}
</style>
```{r setup, include=FALSE}
pacman:: p_load(DT, flexdashboard, gridExtra, knitr, MASS, plotly, rpart, rpart.plot, tidyverse, usmap,dplyr,MASS, pacman, ggplot2, faraway)
```
```{r data cleaning, eval=FALSE}
#----------read file and combine them vertically by year.-----------
path = "../Data"
allpopfiles <- list.files(path, pattern = ".txt", full.names = TRUE)
diabetes <- data.frame()
for (i in 1:length(allpopfiles)){
temp <- read_tsv(allpopfiles[i])
temp <- temp %>%
mutate(Year = str_extract(allpopfiles[i], "20.."))
diabetes <- rbind.data.frame(diabetes, temp)
}
```
```{r read_data}
diabetes <- read_csv("../Data/diabetes.csv")
#------------create a new variables "region"------------------------
Regions <- cbind.data.frame(state.name, state.region)
Northeast <- Regions$state.name[Regions$state.region=="Northeast"]
South <- Regions$state.name[Regions$state.region=="South"]
North_Central <- Regions$state.name[Regions$state.region=="North Central"]
West <- Regions$state.name[Regions$state.region=="West"]
diabetes <- diabetes %>%
rename(Age_Groups = `Five-Year Age Groups`,
Ethnicity =`Hispanic Origin`,
Crude_Rate = `Crude Rate`,
Sex = Gender) %>%
dplyr::select(-c(Notes,`State Code`,`Gender Code`,`Race Code`,`Hispanic Origin Code`,`Five-Year Age Groups Code`)) %>%
mutate(Region = case_when(
State %in% Northeast ~ "Northeast",
State %in% South ~ "South",
State %in% North_Central ~ "North_Central",
State %in% West ~ "West"))
all_ages_groups <- c("< 1", "1-4", "5-9", "10-14", "15-19",
"20-24", "25-29", "30-34", "35-39", "40-44",
"45-49", "50-54", "55-59", "60-64", "65-69",
"70-74", "75-79", "80-84")
diabetes$Age_Groups <- factor(gsub(" years| year", "",
diabetes$Age_Groups),
levels = all_ages_groups)
#-------------create a new variable "Death_Rate" per 10,0000---------
diabetes <- diabetes %>% filter(Population != "Not Applicable")
diabetes$Population <- as.numeric(diabetes$Population)
diabetes <- mutate(diabetes,Mortality_rate = round(Deaths/Population * 100000,2)) %>%
dplyr::select(-Crude_Rate)
#--------------Data Cleaning------------------
diabetes[sapply(diabetes, is.character)] <- lapply(diabetes[sapply(diabetes, is.character)], as.factor)
diabetes <- diabetes %>% filter(State != "District of Columbia")
diabetes$Year <- as.numeric(diabetes$Year)
```
Introduction
======================================================================
Column {data-width=450}
----------------------------------------------------------------------
### Motivation
<font size=3>
**Predictive Modelling Of Mortality Rates for Diabetes: An Analysis Of Risk Factors**
</font>
Diabetes is a chronic disease affecting millions globally and a leading cause of death. Understanding mortality trends and predicting future mortality rates is crucial for public health planning and intervention strategies. Our objective is to identify critical factors that contribute to mortality from diabetes and develop a suitable model for predicting future mortality rates. The outcomes of this study could inform policies and interventions aimed at reducing the burden of diabetes-related mortality.
Resource: [CDC WONDER](https://wonder.cdc.gov/ucd-icd10.html)
### An overlook of the dataset
This project analyzes mortality rates for diabetes in the United States from 2015 to 2020 and employs various predictive methods in Statistics and Machine Learning to forecast future mortality rates.
Variables:
- Locations: States, Regions
- Gender : Female, Male
- Age: Five-Year age group from age 0-84
- Race: White, Black or African American, American Indian or Alaska Native
- Ethnicity: Hispanic or Latino, Not Hispanic or Latino
- Year: From 2015 - 2020
Column {data-width=550}
-----------------------------------------------------------------------
### Glimpse
```{r}
datatable(diabetes, rownames = FALSE,
colnames = c("State", "Sex", "Age Group", "Race", "Ethnicity", "Death", "Population", "Year", "Region", "Mortality (Per 100,000)"),
class = "hover",
options = list(
columnDefs = list(list(className = 'dt-center',
targets = 1:5)),
pageLength = 10,
initComplete = JS("function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#96ACA3', 'color': '#fff'});","}")
))
```
Mortality
======================================================================
```{r, fig.align='center'}
diabetes_state <- diabetes %>%
group_by(State) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality_rate = round(Deaths/Population*100000,2)) %>%
right_join(Regions,by=c("State" = "state.name")) %>%
rename(Region = state.region)
US_map <- us_map("state") %>% filter(full != "District of Columbia")
rate_data <- diabetes_state %>%
left_join(US_map, by = c("State"="full"))
label <- rate_data %>%
group_by(abbr) %>%
summarise(x = mean(x), y = mean(y))
p <- ggplot(rate_data, aes(x = x, y = y,color = "black")) +
geom_polygon(aes(group = group, fill = Mortality_rate,
text = paste0(State,":\n", Mortality_rate, " deaths per 100,000")),
colour = "black") +
geom_text(aes(label = abbr),data = label,fontface = "bold",color="white") +
scale_fill_continuous(low = "#F6F5E9", high = "#6F6360",name = "Mortality rate") +
theme_minimal() +
theme(axis.title.x=element_blank(), axis.text.x=element_blank(),
axis.ticks.x=element_blank(), axis.title.y=element_blank(),
axis.text.y =element_blank(), axis.ticks.y=element_blank(),
panel.grid.major = element_blank(),
panel.background = element_blank())
ggplotly(p, tooltip = "text", width = 1100, height = 700)
```
Data Exploration
===
Column {.tabset data-width=700}
-----------------------------------------------------------------------
### Age
```{r}
diabetes_age <- diabetes %>%
group_by(Age_Groups) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
a <- ggplot(diabetes_age,aes(y=reorder(Age_Groups, Mortality),x = Mortality)) +
geom_col(fill = "#A28A8F",aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000"))) +
xlab("Mortality Rate Per 100,000 People") +
ylab("Age Group (Years)") +
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(a,tooltip = "text")
```
### Ethnicity
```{r}
diabetes_Ethnicity <- diabetes %>%
group_by(Ethnicity) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
h <- ggplot(diabetes_Ethnicity,aes(x=Ethnicity,y=Mortality)) +
geom_col(fill = "#73564C") +
ylab("Mortality Rate Per 100,000 People") +
xlab("Ethnicity")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(h)
```
### Race
```{r}
diabetes_race <- diabetes %>%
group_by(Race) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality= round(Deaths/Population*100000,2))
r <- ggplot(diabetes_race,aes(y=reorder(Race,Mortality),x=Mortality)) +
geom_col(fill = "#98A28A",aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000")))+
xlab("Mortality Rate Per 100,000 People") +
ylab("Race") +
theme(axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12))
ggplotly(r)
```
### Sex
```{r}
diabetes_sex <- diabetes %>%
group_by(Sex) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
s <- ggplot(diabetes_sex,aes(x=Sex,y=Mortality)) +
geom_col(fill = "#BF9568")+
ylab("Mortatility Rate Per 100,000 People")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(s)
```
### Year
```{r}
diabetes_year <- diabetes %>%
group_by(Year) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
y <- (ggplot(diabetes_year,aes(x=Year,y=Mortality))) +
geom_col(fill = "#316C6C")+
ylab("Mortatility Rate Per 100,000 People")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(y)
```
### Region
```{r}
## mortality rate by region
diabetes_region <- diabetes %>%
group_by(Region) %>%
summarise(Mortality = round(sum(Deaths)/sum(Population)*100000, 2))
r <- diabetes_region %>%
ggplot(aes(x = Region, y = Mortality)) +
geom_bar(stat="identity", fill="#637493") +
ylab("Mortatility Rate Per 100,000 People") +
geom_polygon(aes(text = paste0("Mortality rate: ", Mortality, " deaths per 100,000")), colour = "black")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(r)
```
### State
```{r}
diabetes_state <- diabetes %>%
group_by(State) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality= round(Deaths/Population*100000, 2)) %>%
right_join(Regions,by=c("State" = "state.name")) %>%
rename(Region = state.region)
q <- ggplot(diabetes_state, aes(y=reorder(State,Mortality),
x=Mortality,fill = Region)) +
geom_col(aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000"))) +
ggtitle("Mortality rate of 6 years in each state")+
ylab("States")+
xlab("Mortality Rate")+
geom_polygon(aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000")), colour = "black")
ggplotly(q,tooltip = "text")
```
### Panel Graph
```{r}
diabetes_Year <- diabetes %>%
group_by(Year, Sex, Race, Ethnicity) %>%
summarise(Mortality = round(sum(Deaths)/sum(Population)*100000, 2))
p_year <- diabetes_Year %>%
ggplot(aes(x=Mortality, y=Race, fill=Sex)) +
geom_bar(stat = "identity",
aes(text = paste0(Sex, Race, ":\n",
"Mortality Rate: ", Mortality, " Per 100,000 People."))) +
facet_grid(Ethnicity~Year) +
xlab("Mortatility rate Per 100,000 People") +
scale_fill_manual(values = c("Male" = "#96ACA3",
"Female" = "#F5A997")) +
theme(text = element_text(size = 12), legend.position = "none") +
theme(axis.text.y = element_text(size = 8))
ggplotly(p_year, tooltip = "text")
```
### Summary
According to the age distribution, the older the age, the higher the mortality rate of diabetics. the greatest mortality rate was recorded in the 80-84 age group, reaching 5963 per 100,000. More surprisingly, infants younger than one year of age also have a high mortality rate, ranking seventh out of 18 age groups, at 570 per 100,000. The lowest mortality rate was in the 5-9 age group, with a rate of 13 per 100,000.
In terms of ethnic distribution, the death rate for non-Hispanics and Latinos is higher than that for Hispanics, almost twice as high. American Indians or Alaska Natives have the highest mortality rate, with a rate of approximately 1061 per 100,000. Asian and Pacific Islanders have the lowest mortality rate, with a rate of approximately 364 per 100,000. By gender distribution, males have a higher mortality rate than females.
The geographical distribution is clear. According to the regional map, the North Central and Southern regions have the highest mortality rates, even almost identical, at around 729, but when it comes to mortality rates per state, seven of the top ten states are from the Southern region, while the top six states are all from the South.
Since 2015, through 2019, the diabetes death rate has steadily increased in each year based on trends, but not by much. But the mortality rate in 2020 is clearly and significantly higher, and this reason is supposedly related to the outbreak of corona virus in March 2020.
Modeling
===
Column {.tabset data-width=350}
---------------------------------------------------------------------
### Negative Binomial Regression Models
Negative binomial regression is used to predict the count of deaths instead of Poisson regression. As checked, in this dataset, there is a huge gap between the mean and variance of the number of deaths.
Negative binomial regression assumes that the count data follows a negative binomial distribution, and the predictor variables have a linear relationship with the logarithm of the expected count.
*Link Function:*
$\log{\mbox{deaths}} = \mbox{Intercept} + b_1x_1+b_2x_2+\cdot + b_kx_k$
### Models Considered
```{r allmodels}
Models <- paste("Model", 1:11)
Predictors <- c("Sex", "Race", "Age Group", "Age Group, Race",
"Race, Sex", "Age Group, Sex", "Age Group, Race, Sex",
"Age Group, Race, Sex, Ethnicity",
"Age Group, Race, Sex, Ethnicity, Year",
"Age Group, Race, Sex, Ethnicity, State",
"Age Group, Race, Sex, Ethnicity, Year, State"
)
all_models <- data.frame(Models = Models, Predictors = Predictors)
kable(all_models)
```
Column {data-width=650}
----------------------------------------------------------------------
### Model Comparison
```{r,include=FALSE}
# + offset(log(Population/100000)), link=log,
#Model 1
fit.sex = glm.nb(Deaths ~ Sex, data = diabetes)
p.val1 = pchisq(deviance(fit.sex), df.residual(fit.sex), lower = F)
p.val1
m1 <- round(c(df.residual(fit.sex), deviance(fit.sex), fit.sex$theta,
fit.sex$aic, fit.sex$twologlik),2)
#Model 2
fit.race = glm.nb(Deaths ~ Race,data = diabetes)
p.val2 = pchisq(deviance(fit.race), df.residual(fit.race), lower = F)
p.val2
m2 <- round(c(df.residual(fit.race), deviance(fit.race), fit.race$theta,
fit.race$aic, fit.race$twologlik),2)
#Model 3
fit.age = glm.nb(Deaths ~ Age_Groups, data = diabetes )
p.val3 = pchisq(deviance(fit.age), df.residual(fit.age), lower = F)
p.val3
m3 <- round(c(df.residual(fit.age), deviance(fit.age), fit.age$theta,
fit.age$aic, fit.age$twologlik),2)
#Model 4
fit.age.race = glm.nb(Deaths ~ Age_Groups + Race, data = diabetes)
p.val4 = pchisq(deviance(fit.age.race), df.residual(fit.age.race), lower = F)
p.val4
m4 <- round(c(df.residual(fit.age.race), deviance(fit.age.race), fit.age.race$theta,
fit.age.race$aic, fit.age.race$twologlik),2)
#Model 5
fit.sex.race = glm.nb(Deaths ~ Race + Sex, data = diabetes)
p.val5 = pchisq(deviance(fit.sex.race), df.residual(fit.sex.race), lower = F)
p.val5
m5 <- round(c(df.residual(fit.sex.race), deviance(fit.sex.race), fit.sex.race$theta,
fit.sex.race$aic, fit.sex.race$twologlik),2)
#Model 6
fit.sex.age = glm.nb(Deaths ~ Age_Groups + Sex, data = diabetes)
p.val6 = pchisq(deviance(fit.sex.age), df.residual(fit.sex.age), lower = F)
p.val6
m6 <- round(c(df.residual(fit.sex.age), deviance(fit.sex.age), fit.sex.age$theta,
fit.sex.age$aic, fit.sex.age$twologlik),2)
#Model 7
fit.sex.age.race = glm.nb(Deaths ~ Age_Groups + Race + Sex, data = diabetes)
p.val7 = pchisq(deviance(fit.sex.age.race), df.residual(fit.sex.age.race), lower = F)
p.val7
m7 <- round(c(df.residual(fit.sex.age.race), deviance(fit.sex.age.race), fit.sex.age.race$theta,
fit.sex.age.race$aic, fit.sex.age.race$twologlik),2)
#Model 8
fit.sex.age.race.his = glm.nb(Deaths ~ Age_Groups + Race + Sex + Ethnicity,data = diabetes)
p.val8 = pchisq(deviance(fit.sex.age.race.his), df.residual(fit.sex.age.race.his), lower = F)
p.val8
m8 <- round(c(df.residual(fit.sex.age.race.his), deviance(fit.sex.age.race.his), fit.sex.age.race.his$theta,
fit.sex.age.race.his$aic, fit.sex.age.race.his$twologlik),2)
#Model 9
fit.sex.age.race.his.year = glm.nb(Deaths ~ Age_Groups + Race + Sex + Ethnicity + Year, data = diabetes)
p.val9= pchisq(deviance(fit.sex.age.race.his.year), df.residual(fit.sex.age.race.his.year), lower = F)
p.val9
m9 <- round(c(df.residual(fit.sex.age.race.his.year), deviance(fit.sex.age.race.his.year), fit.sex.age.race.his.year$theta,
fit.sex.age.race.his.year$aic, fit.sex.age.race.his.year$twologlik),2)
#Model 10
fit.sex.age.race.his.State = glm.nb(Deaths ~ Age_Groups+ Race + Sex+ Ethnicity + State, data = diabetes)
p.val10= pchisq(deviance(fit.sex.age.race.his.State), df.residual(fit.sex.age.race.his.State), lower = F)
p.val10
m10 <- round(c(df.residual(fit.sex.age.race.his.State), deviance(fit.sex.age.race.his.State), fit.sex.age.race.his.State$theta,
fit.sex.age.race.his.State$aic, fit.sex.age.race.his.State$twologlik),2)
#Model 11
fit.sex.age.race.his.year.State = glm.nb(Deaths ~ Age_Groups +
Race+ Sex + Ethnicity+ Year + State, data = diabetes)
p.val11= pchisq(deviance(fit.sex.age.race.his.year.State), df.residual(fit.sex.age.race.his.year.State), lower = F)
p.val11
m11 <- round(c(df.residual(fit.sex.age.race.his.year.State), deviance(fit.sex.age.race.his.year.State), fit.sex.age.race.his.year.State$theta,
fit.sex.age.race.his.year.State$aic, fit.sex.age.race.his.year.State$twologlik),2)
```
```{r}
################# SUMMARY Table-2 ####################
tab2 <- rbind(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10,m11)
colnames(tab2) = c( "d.f.", "dev", "disp", "AIC", "2Llik")
rownames(tab2) = c("Model 1", "Model 2", "Model 3","Model 4","Model 5","Model 6","Model 7","Model 8","Model 9","Model 10","Model 11")
knitr::kable(tab2, digits=3)
```
Diagnosis
===
Column {.tabset data-wigth=1000}
----------------------------------------------------------------------
### Model 1
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r, fig.align = "center"}
halfnorm(residuals(fit.sex), main = " Model 1: Sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r, fig.align = "center"}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 1: Sex")
lines(density(predict(fit.sex, type="response")), col = "red")
```
:::
::::::
### Model 2
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.race),
main = "Model 2: Race")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 2: Race")
lines(density(predict(fit.race, type="response")), col = "red")
```
:::
::::::
### Model 3
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.age),
main = "Model 3: Age Group")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 3: Age Group")
lines(density(predict(fit.race, type="response")), col = "red")
```
:::
::::::
### Model 4
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.age.race),
main = "Model 4: Age Group + Race")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 4: Age Group + Race")
lines(density(predict(fit.age.race, type="response")), col = "red")
```
:::
::::::
### Model 5
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.race),
main = "Model 5: Race + Sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 5: Race + Sex")
lines(density(predict(fit.sex.race, type="response")), col = "red")
```
:::
::::::
### Model 6
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age),
main = "Model 66: Age Group + sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 6: Age Group + Sex")
lines(density(predict(fit.sex.age, type="response")), col = "red")
```
:::
::::::
### Model 7
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race),
main = "Model 7: Age Group + Race + Sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 7: Age Group + Race + Sex")
lines(density(predict(fit.sex.age.race, type="response")), col = "red")
```
:::
::::::
### Model 8
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his),
main = "Age Group + Race + Sex + Ethnicity")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 8: Age Group")
lines(density(predict(fit.sex.age.race.his, type="response")), col = "red")
```
:::
::::::
### Model 9
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his.year),
main = "Age Group + Race + Sex + Ethnicity + Year")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 9: Age Group + Race + Sex + Ethnicity + Year")
lines(density(predict(fit.sex.age.race.his.year, type="response")), col = "red")
```
:::
::::::
### Model 10
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his.State),
main = "Age Group + Race + Sex + Ethnicity + State")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 10: Age Group + Race + Sex + Ethnicity + State")
lines(density(predict(fit.sex.age.race.his.State, type="response")), col = "red")
```
:::
::::::
### Model 11
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his.year.State),
main = "Age Group + Race + Sex + Ethnicity + Year + State")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 11: Age Group + Race + Sex + Ethnicity + Year + State")
lines(density(predict(fit.sex.age.race.his.year.State, type="response")), col = "red")
```
:::
::::::
Discussion
===
Column {data-width=700}
-----------------------------------------------------------------------
### Result
```{r}
m11_result <- summary(fit.sex.age.race.his.year.State)
m11_result_RR <- round(m11_result[[12]], 4)
m11_result_RR <- cbind(m11_result_RR, round(exp(m11_result_RR[,1]), 2))
colnames(m11_result_RR)[5] <- "Relative Rates"
#DT::datatable(m11_result_RR)
knitr::kable(m11_result_RR)
```
Column {.tabset}
-----------------------------------------------------------------------
### Analysis
The baselines of each factor are :
- Age group: <1
- Race: American Indian or Alaska native
- Ethnicity: Hispanic or Latino
- Sex: Female
- State: Alabama
<font size = 3>
**Analysis:**
</font>
- Ethnicity: The risk of dying due to diabetes among Non-Hispanic or Latino is about 9.33 times higher than that for the Hispanic or Latino holding all other variables constant.
- Race: Highest risk of death due to diabetes is reported for white Americans. A person of white is 17.82 times more likely to die due to diabetes than American Indian or Alaska Native on average holding all other variables constant.
- Sex: Males have about 1.48 times higher rick of dying due to diabetes than females on average holding all other variables constant.
### Limitation
All the predictor variables in this dataset are categorical variables. Predictor variables such as age, gender and biomarkers are used to estimate a patient's risk of developing a particular disease based on their age, gender, etc. The categorical variables are discrete and allow for easy data collection and summarisation. But using only categorical variables can lead to sparse data, especially if there are many categories or some categories occur infrequently. It may also not fully capture the complexity and nuances of the underlying phenomena. This can make it difficult to identify meaningful relationships or patterns in the data and reduce the accuracy of predictive models.
If it is possible to continue to do similar studies in the future, I would continue to add additional relevant variables in terms of the receipt data, such as including other clinical characteristics, or some laboratory values. At the same time I would consider removing less important variables to improve the accuracy of the model. Diabetes is a chronic metabolic disease that is influenced by genetic factors, environmental factors and lifestyle habits. If data on lifestyle habits could be collected, it would also help to predict the onset of diabetes and mortality more accurately in other ways.